156 research outputs found
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Implicit Actions and Non-blocking Failure Recovery with MPI
Scientific applications have long embraced the MPI as the environment of
choice to execute on large distributed systems. The User-Level Failure
Mitigation (ULFM) specification extends the MPI standard to address resilience
and enable MPI applications to restore their communication capability after a
failure. This works builds upon the wide body of experience gained in the field
to eliminate a gap between current practice and the ideal, more asynchronous,
recovery model in which the fault tolerance activities of multiple components
can be carried out simultaneously and overlap. This work proposes to: (1)
provide the required consistency in fault reporting to applications (i.e.,
enable an application to assess the success of a computational phase without
incurring an unacceptable performance hit); (2) bring forward the building
blocks that permit the effective scoping of fault recovery in an application,
so that independent components in an application can recover without
interfering with each other, and separate groups of processes in the
application can recover independently or in unison; and (3) overlap recovery
activities necessary to restore the consistency of the system (e.g., eviction
of faulty processes from the communication group) with application recovery
activities (e.g., dataset restoration from checkpoints).Comment: Accepted in FTXS'22 https://sites.google.com/view/ftxs202
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors
Factorization and multiplication of dense matrices and tensors are critical,
yet extremely expensive pieces of the scientific toolbox. Careful use of low
rank approximation can drastically reduce the computation and memory
requirements of these operations. In addition to a lower arithmetic complexity,
such methods can, by their structure, be designed to efficiently exploit modern
hardware architectures. The majority of existing work relies on batched BLAS
libraries to handle the computation of many small dense matrices. We show that
through careful analysis of the cache utilization, register accumulation using
SIMD registers and a redesign of the implementation, one can achieve
significantly higher throughput for these types of batched low-rank matrices
across a large range of block and batch sizes. We test our algorithm on 3 CPUs
using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148
using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching
methodology is able to obtain more than twice the throughput of vendor
optimized libraries for all CPU architectures and problem sizes
Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver
International audienceThe advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work
Recommended from our members
Coordinated Fault Tolerance for High-Performance Computing
Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools
Local Rollback for Resilient Mpi Applications With Application-Level Checkpointing and Message Logging
[Abstract]
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface – the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard – enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the ComPiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level—thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grants of Nuria Losada ref. BES-2014-068066 and ref. EEBB-I-17-12005); by EU under the COST Program Action IC1305 Network for Sustainable Ultrascale Computing (NESUS) and a HiPEAC Collaboration Grant and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04). We gratefully thank Galicia Supercomputing Center for providing access to the FinisTerrae-II supercomputer.
This material is also based upon work supported by the US National Science Foundation, Office of Advanced Cyberinfrastructure , under Grants No. #1664142 and #1339763Xunta de Galicia; ED431C 2017/04US National Science Foundation, Office of Advanced Cyberinfrastructure; 1664142US National Science Foundation, Office of Advanced Cyberinfrastructure; 133976
A Multithreaded Communication Substrate for OpenSHMEM
ABSTRACT OpenSHMEM scalability is strongly dependent on the capability of its communication layer to efficiently handle multiple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Communication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an acceptable cost and can improve efficiency for some operations, compared to serializing communication
- …